Skip to main content

Ascend NPU Environment Configuration Guide

Last updated: 04/27/2026.

This document describes the key environment variables for running ROLL on Huawei Ascend NPU, covering device management, HCCL communication, memory optimization, CPU scheduling, vLLM-Ascend inference, and debugging.

Environment Variables Set by ROLL

ROLL automatically injects the following environment variables at runtime (defined in roll/platforms/npu.py):

VariableValueDescription
ASCEND_RT_VISIBLE_DEVICESe.g. "0,1,2,3"Controls NPU device visibility, analogous to CUDA_VISIBLE_DEVICES for GPU
RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES"1"Prevents Ray from overriding ASCEND_RT_VISIBLE_DEVICES
VLLM_ALLOW_INSECURE_SERIALIZATION"1"Allows vLLM to use insecure serialization for cross-process tensor transfer via Ray
RAY_get_check_signal_interval_milliseconds"1"Reduces Ray plasma lock hold time to avoid lock starvation under multi-worker load
RAY_CGRAPH_get_timeout"600"Ray compute graph fetch timeout in seconds

Docker Image Environment Variables

The pre-built Ascend images described in Ascend NPU Docker Usage Guide include the following environment settings:

VariableValueDescription
ASCEND_HOME_PATH/usr/local/Ascend/ascend-toolkit/latestCANN toolkit root path
LD_LIBRARY_PATHIncludes multiple Ascend lib64 pathsDynamic library search path, ensures libascendcl.so etc. can be loaded

The following CANN environment scripts are automatically sourced via /root/.bashrc in the pre-built images:

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

Ray Cluster Environment Variables (Multi-Node)

These variables control how ROLL forms a Ray cluster across multiple NPU nodes. They are defined in roll/distributed/scheduler/driver_utils.py and consumed by roll/distributed/scheduler/initialize.py:

VariableDefaultDescription
RANK0Node rank. 0 = head node, 1, 2, 3... = worker nodes
WORLD_SIZE1Total number of nodes in the cluster
MASTER_ADDR127.0.0.1IP address of the head node
MASTER_PORT6379Ray head node port (also default Ray port)
DASHBOARD_PORT8265Ray dashboard web UI port
WORKER_ID<MASTER_ADDR>:<RANK>Node name used in Ray cluster, auto-derived if not set

When RANK=0, ROLL automatically runs ray start --head --port=<MASTER_PORT>. When RANK>0, ROLL sleeps 5 seconds then runs ray start --address=<MASTER_ADDR>:<MASTER_PORT> to join the cluster. After all nodes join, worker nodes exit (sys.exit(0)), leaving only the head node to execute the training pipeline.

Example (head node, set before launching the pipeline):

export RANK=0
export WORLD_SIZE=2
export MASTER_ADDR=10.0.0.1
export MASTER_PORT=6379
export DASHBOARD_PORT=8265

Example (worker node, set before joining):

export RANK=1
export WORLD_SIZE=2
export MASTER_ADDR=10.0.0.1
export MASTER_PORT=6379

You can also pre-start Ray manually (ray start --head / ray start --address=...) before running ROLL. ROLL will detect the existing cluster and skip auto-start.

HCCL Communication Variables

These variables control the behavior of HCCL (Huawei Collective Communication Library), the distributed communication backend for NPU (equivalent to NCCL on GPU):

VariableRecommended ValueDescription
HCCL_CONNECT_TIMEOUT3600Link establishment timeout in seconds (default 120s). Increase for large model training
HCCL_EXEC_TIMEOUT3600Collective operation execution timeout in seconds. Increase for long-running training steps
HCCL_DETERMINISTICfalseDisable deterministic computation. Enabling it significantly reduces communication performance
HCCL_OP_EXPANSION_MODE"AIV"Communication algorithm dispatch location. AIV uses Vector Core, outperforms AI_CPU/HOST/HOST_TS
HCCL_BUFFSIZEe.g. "2147483648"HCCL communication buffer size in bytes. Increase for large data volume scenarios
HCCL_IF_IPNode's IP addressSpecify the IP address used by HCCL for inter-node communication. Required for multi-node training
HCCL_SOCKET_IFNAMEe.g. "enp194s0f0"Network interface name for HCCL socket communication. Must be consistent across all nodes
HCCL_IF_BASE_PORTe.g. 23456Base port for HCCL inter-node communication. Ensure ports are not blocked by firewall
HCCL_WHITELIST_DISABLE1Disable HCCL whitelist check. May be needed when encountering communication errors in certain environments

Example (single-node):

export HCCL_CONNECT_TIMEOUT=3600
export HCCL_DETERMINISTIC=false
export HCCL_OP_EXPANSION_MODE="AIV"

Example (multi-node):

export HCCL_CONNECT_TIMEOUT=3600
export HCCL_EXEC_TIMEOUT=3600
export HCCL_DETERMINISTIC=false
export HCCL_OP_EXPANSION_MODE="AIV"
export HCCL_IF_IP=$(hostname -I | awk '{print $1}')
export HCCL_SOCKET_IFNAME="enp194s0f0"
export HCCL_IF_BASE_PORT=23456

NPU Memory Variables

VariableRecommended ValueDescription
NPU_MEMORY_FRACTION0.96Fraction of NPU memory available for use (default 0.8). Increase to 0.95+ for large model inference
PYTORCH_NPU_ALLOC_CONFexpandable_segments:TrueEnable PyTorch NPU memory pool expandable segments, reducing memory fragmentation and OOM risk
MULTI_STREAM_MEMORY_REUSE1Enable multi-stream memory reuse to reduce memory footprint
TASK_QUEUE_ENABLE2Task dispatch optimization. Set to 2 for non-graph mode, 1 for graph mode
COMBINED_ENABLE1Enable operator combination optimization. Fuses multiple small operators into larger ones to reduce kernel launch overhead

Example:

export NPU_MEMORY_FRACTION=0.96
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export MULTI_STREAM_MEMORY_REUSE=1
export TASK_QUEUE_ENABLE=2
export COMBINED_ENABLE=1

CPU Scheduling Variables

VariableRecommended ValueDescription
CPU_AFFINITY_CONF2CPU core affinity optimization to avoid cross-NUMA memory access. 1=coarse-grained, 2=fine-grained (recommended)
OMP_NUM_THREADS1OpenMP thread count. Set to 1 in distributed training to avoid over-subscription

Example:

export CPU_AFFINITY_CONF=2
export OMP_NUM_THREADS=1

Custom per-NPU affinity is also supported:

export CPU_AFFINITY_CONF=1,npu0:0-1,npu1:2-3,npu2:4-5,npu3:6-7

vLLM-Ascend Inference Variables

VariableRecommended ValueDescription
VLLM_USE_V11Enable vLLM V1 architecture. Required for vLLM-Ascend
VLLM_ATTENTION_BACKENDXFORMERSvLLM attention computation backend
VLLM_ASCEND_ENABLE_FLASHCOMM1Enable Ascend FlashComm high-speed communication optimization
VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE1Enable dense computation optimization for large model inference
VLLM_ASCEND_ENABLE_PREFETCH_MLP1Enable MLP layer weight prefetching
VLLM_ASCEND_ENABLE_TOPK_OPTIMIZE1Enable TopK operator fusion optimization for generation decoding
VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE1Print prefill/decode phase timing details (for debugging)
VLLM_ASCEND_TRACE_RECOMPILES1Trace operator recompilation for debugging performance issues
VLLM_ENABLE_MC21Enable MC2 communication optimization for multi-node inference

Example:

export VLLM_USE_V1=1
export VLLM_ATTENTION_BACKEND=XFORMERS
export VLLM_ASCEND_ENABLE_FLASHCOMM=1
export VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE=1
export VLLM_ASCEND_ENABLE_PREFETCH_MLP=1

CANN Logging & Debugging Variables

VariableRecommended ValueDescription
ASCEND_GLOBAL_LOG_LEVEL3 (ERROR)CANN log level: 0=DEBUG, 1=INFO, 2=WARNING, 3=ERROR
ASCEND_SLOG_PRINT_TO_STDOUT1Print CANN logs to stdout (for debugging)
ASDOPS_LOG_LEVELERROROperator library log level
ATB_LOG_LEVELERRORATB acceleration library log level
ASCEND_LAUNCH_BLOCKING1Enable synchronous execution for error localization. Set to 1 only when debugging NPU errors, as it disables async execution and severely degrades performance
caution

Leaving debug/info log levels enabled in production will significantly degrade performance. Always set log levels to ERROR for production workloads.

Example (debugging):

export ASCEND_GLOBAL_LOG_LEVEL=0
export ASCEND_SLOG_PRINT_TO_STDOUT=1
export ASCEND_LAUNCH_BLOCKING=1

Example (production):

export ASCEND_GLOBAL_LOG_LEVEL=3
export ASDOPS_LOG_LEVEL=ERROR
export ATB_LOG_LEVEL=ERROR

CANN Operator Compilation & Precision Variables

VariableRecommended ValueDescription
ACL_OP_COMPILER_CACHE_MODEenableEnable operator compilation cache to avoid recompilation on repeated runs
ACL_OP_COMPILER_CACHE_DIRe.g. /tmp/npu_cacheDirectory to store operator compilation cache
ASCEND_MAX_OP_CACHE_SIZEe.g. 5000Maximum operator cache size. Increase to prevent performance degradation from cache eviction during long training
ACL_PRECISION_MODEallow_fp32_to_fp16Allow automatic FP32-to-FP16 precision conversion for unsupported FP32 operators

Example:

export ACL_OP_COMPILER_CACHE_MODE=enable
export ACL_OP_COMPILER_CACHE_DIR=/tmp/npu_cache
export ASCEND_MAX_OP_CACHE_SIZE=5000
export ACL_PRECISION_MODE=allow_fp32_to_fp16

Single-Node

For single-node multi-NPU distributed RL training, add the following to your startup script or ROLL YAML config:

# HCCL communication
export HCCL_CONNECT_TIMEOUT=3600
export HCCL_EXEC_TIMEOUT=3600
export HCCL_DETERMINISTIC=false
export HCCL_OP_EXPANSION_MODE="AIV"

# NPU memory
export NPU_MEMORY_FRACTION=0.96
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export MULTI_STREAM_MEMORY_REUSE=1
export TASK_QUEUE_ENABLE=2
export COMBINED_ENABLE=1

# CPU scheduling
export CPU_AFFINITY_CONF=2
export OMP_NUM_THREADS=1

# vLLM-Ascend inference
export VLLM_USE_V1=1
export VLLM_ASCEND_ENABLE_FLASHCOMM=1
export VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE=1
export VLLM_ASCEND_ENABLE_PREFETCH_MLP=1

# Operator compilation cache
export ACL_OP_COMPILER_CACHE_MODE=enable
export ACL_OP_COMPILER_CACHE_DIR=/tmp/npu_cache
export ASCEND_MAX_OP_CACHE_SIZE=5000

# Logging (production)
export ASCEND_GLOBAL_LOG_LEVEL=3
export ASDOPS_LOG_LEVEL=ERROR
export ATB_LOG_LEVEL=ERROR

Multi-Node

For multi-node training, add the Ray cluster variables on top of the single-node configuration:

# Ray cluster (multi-node)
export RANK=0 # 0=head, 1/2/3=worker
export WORLD_SIZE=2 # Total number of nodes
export MASTER_ADDR=10.0.0.1 # Head node IP
export MASTER_PORT=6379 # Ray communication port
export DASHBOARD_PORT=8265 # Ray dashboard port

# HCCL multi-node communication
export HCCL_CONNECT_TIMEOUT=3600
export HCCL_EXEC_TIMEOUT=3600
export HCCL_DETERMINISTIC=false
export HCCL_OP_EXPANSION_MODE="AIV"
export HCCL_IF_IP=$(hostname -I | awk '{print $1}')
export HCCL_SOCKET_IFNAME="enp194s0f0"
export HCCL_IF_BASE_PORT=23456

# ... (rest of NPU memory, CPU, vLLM, cache, logging variables as above)

Or configure via ROLL YAML:

system_envs:
HCCL_CONNECT_TIMEOUT: "3600"
HCCL_EXEC_TIMEOUT: "3600"
HCCL_DETERMINISTIC: "false"
HCCL_OP_EXPANSION_MODE: "AIV"
HCCL_IF_IP: "10.0.0.1"
HCCL_SOCKET_IFNAME: "enp194s0f0"
HCCL_IF_BASE_PORT: "23456"
NPU_MEMORY_FRACTION: "0.96"
PYTORCH_NPU_ALLOC_CONF: "expandable_segments:True"
CPU_AFFINITY_CONF: "2"
OMP_NUM_THREADS: "1"
COMBINED_ENABLE: "1"
VLLM_USE_V1: "1"
ACL_OP_COMPILER_CACHE_MODE: "enable"
ACL_OP_COMPILER_CACHE_DIR: "/tmp/npu_cache"

Disclaimer

The Ascend support provided in ROLL is intended as a reference example. For production use, please consult official channels.